Goto

Collaborating Authors

 section 6


On the Role of Batch Size in Stochastic Conditional Gradient Methods

Islamov, Rustem, Machacek, Roman, Lucchi, Aurelien, Silveti-Falls, Antonio, Gorbunov, Eduard, Cevher, Volkan

arXiv.org Machine Learning

We study the role of batch size in stochastic conditional gradient methods under a $μ$-Kurdyka-Łojasiewicz ($μ$-KL) condition. Focusing on momentum-based stochastic conditional gradient algorithms (e.g., Scion), we derive a new analysis that explicitly captures the interaction between stepsize, batch size, and stochastic noise. Our study reveals a regime-dependent behavior: increasing the batch size initially improves optimization accuracy but, beyond a critical threshold, the benefits saturate and can eventually degrade performance under a fixed token budget. Notably, the theory predicts the magnitude of the optimal stepsize and aligns well with empirical practices observed in large-scale training. Leveraging these insights, we derive principled guidelines for selecting the batch size and stepsize, and propose an adaptive strategy that increases batch size and sequence length during training while preserving convergence guarantees. Experiments on NanoGPT are consistent with the theoretical predictions and illustrate the emergence of the predicted scaling regimes. Overall, our results provide a theoretical framework for understanding batch size scaling in stochastic conditional gradient methods and offer guidance for designing efficient training schedules in large-scale optimization.


On the Peril of (Even a Little) Nonstationarity in Satisficing Regret Minimization

Zhang, Yixuan, Zhu, Ruihao, Xie, Qiaomin

arXiv.org Machine Learning

Motivated by the principle of satisficing in decision-making, we study satisficing regret guarantees for nonstationary $K$-armed bandits. We show that in the general realizable, piecewise-stationary setting with $L$ stationary segments, the optimal regret is $Θ(L\log T)$ as long as $L\geq 2$. This stands in sharp contrast to the case of $L=1$ (i.e., the stationary setting), where a $T$-independent $Θ(1)$ satisficing regret is achievable under realizability. In other words, the optimal regret has to scale with $T$ even if just a little nonstationarity presents. A key ingredient in our analysis is a novel Fano-based framework tailored to nonstationary bandits via a \emph{post-interaction reference} construction. This framework strictly extends the classical Fano method for passive estimation as well as recent interactive Fano techniques for stationary bandits. As a complement, we also discuss a special regime in which constant satisficing regret is again possible.






We present conditional monotonicity results using alternative estimators of performance quality

Neural Information Processing Systems

The Appendix is structured as follows: We provide a proof of conditional guarantees in EENNs for (hard) PoE in Appendix A . We conduct an ablation study for our P A model in Appendix B.2 . We report results of NLP experiments in Appendix B.4 . We discuss anytime regression and deep ensembles in Appendix B.6 . We propose a technique for controlling the violations of conditional monotonicity in P A in Appendix B.8 .